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Abstract 


It has long been conjectured that hypotheses spaces suitable for data that is compositional in 
nature, such as text or images, may be more efficiently represented with deep hierarchical networks 
than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical 
justifications to date are limited. In particular, they do not account for the locality, sharing and 
pooling constructs of convolutional networks, the most successful deep learning architecture to 
date. In this work we derive a deep network architecture based on arithmetic circuits that inherently 
employs locality, sharing and pooling. An equivalence between the networks and hierarchical 
tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) 
decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using 
tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions 
that can be implemented by a deep network of polynomial size, require exponential size in order to 
be realized (or even approximated) by a shallow network. Since log-space computation transforms 
our networks into SimNets, the result applies directly to a deep learning architecture demonstrating 
promising empirical performance. The construction and theory developed in this paper shed new 
light on various practices and ideas employed by the deep learning community. 

Keywords: Deep Learning, Expressive Power, Arithmetic Circuits, Tensor Decompositions 

1. Introduction 

The expressive power of neural networks is achieved through depth. There is mounting empirical 
evidence that for a given budget of resources (e.g. neurons), the deeper one goes, the better the 
eventual performance will be. However, existing theoretical arguments that support this empirical 
finding are limited. There have been many attempts to theoretically analyze function spaces gen¬ 
erated by network architectures, and their dependency on network depth and size. The prominent 
approach for justifying the power of depth is to show that deep networks can efficiently express 
functions that would require shallow networks to have super-polynomial size. We refer to such 
scenarios as instances of depth efficiency. Unfortunately, existing results dealing with depth effi¬ 
ciency (e.g. Hasfad (1986); Hasfad and Goldmann (1991); Delalleau and Bengio (2011); Martens 
and Medabalimi (2014)) fypically apply fo specific nefwork archifecfures fhaf do nof resemble ones 
commonly used in pracfice. In particular, none of fhese resulfs apply fo convolutional nefworks 
(LeCun and Bengio (1995)), which represenf fhe mosf empirically successful and widely used deep 
learning archifecfure fo dale. A further limilafion of currenl resulfs is thaf Ihey merely show ex¬ 
istence of deplh efficiency (i.e. of functions lhal are efficienlly realizable wilh a cerlain deplh bul 
cannof be efficienfly realized wifh shallower deplhs), wilhouf providing any informalion as fo how 
frequenf fhis properly is. These shorlcomings of currenl Iheory are fhe ones fhaf molivafed our work. 
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The architectural features that specialize convolutional networks compared to classic feed¬ 
forward fully-connected networks are threefold. The first feature, locality, refers to the connection 
of a neuron only to neighboring neurons in the preceding layer, as opposed to having the entire 
layer drive it. In the context of image processing (the most common application of convolutional 
networks), locality is believed to reflect the inherent compositional structure of data - the closer 
pixels are in an image, the more likely they are to be correlated. The second architectural feature of 
convolutional networks is sharing, which means that different neurons in the same layer, connected 
to different neighborhoods in the preceding layer, share the same weights. Sharing, which together 
with locality gives rise to convolution, is motivated by the fact that in natural images, the semantic 
meaning of a pattern often does not depend on its location (i.e. two identical patterns appearing in 
different locations of an image often convey the same semantic content). Finally, the third archi¬ 
tectural idea of convolutional networks is pooling, which is essentially an operator that decimates 
layers, replacing neural activations in a spatial window by a single value (e.g. their maximum or 
average). In the context of images, pooling induces invariance to translations (which often do not 
affect semantic content), and in addition is believed to create a hierarchy of abstraction in the pat¬ 
terns neurons respond to. The three architectural elements of locality, sharing and pooling, which 
have facilitated the great success of convolutional networks, are all lacking in existing theoretical 
studies of depth efficiency. 

In this paper we introduce a convolutional arithmetic circuit architecture that incorporates lo¬ 
cality, sharing and pooling. Arithmetic circuits (also known as Sum-Product Networks, Poon and 
Domingos (2011)) are networks with two types of nodes: sum nodes, which compute a weighted 
sum of their inputs, and product nodes, computing the product of their inputs. We use sum nodes to 
implement convolutions (locality with sharing), and product nodes to realize pooling. The models 
we arrive at may be viewed as convolutional networks with product pooling and linear point-wise ac¬ 
tivation. They are attractive on three accounts. First, as discussed in app. E, convolutional arithmetic 
circuits are equivalent to SimNets, a new deep learning architecture that has recently demonstrated 
promising empirical results on various image recognition benchmarks (Cohen et al. (2016)). Sec¬ 
ond, as we show in sec. 3, convolutional arithmetic circuits are realizations of hierarchical tensor 
decompositions (see Hackbusch (2012)), opening the door to various mathematical and algorithmic 
tools for their analysis and implementation. Third, the depth efficiency of convolutional arithmetic 
circuits, which we analyze in sec. 4, was shown in the subsequent work of Cohen and Shashua 
(2016) to be superior to the depth efficiency of the popular convolutional rectifier networks, namely 
convolutional networks with rectified linear (ReLU) activation and max or average pooling. 

Employing machinery from measure theory and matrix algebra, made available through their 
connection to hierarchical tensor decompositions, we prove a number of fundamental results con¬ 
cerning the depth efficiency of our convolutional arithmetic circuits. Our main theoretical result 
(thm. 1 and corollary 2) states that besides a negligible (zero measure) set, all functions that can 
be realized by a deep network of polynomial size, require exponential size in order to be realized, 
or even approximated, by a shallow network. When translated to the viewpoint of tensor decom¬ 
positions, this implies that almost all tensors realized by Hierarchical Tucker (HT) decomposition 
(Hackbusch and Kiihn (2009)) cannot be efficiently realized by the classic CP (rank-1) decompo¬ 
sition. To the best of our knowledge, this result is unknown to the tensor analysis community, in 
which the advantage of HT over CP is typically demonstrated through specific examples of tensors 
that can be efficiently realized by the former and not by the latter. Eollowing our main result, we 
present a generalization (thm. 3 and corollary 4) that compares networks of arbitrary depths, show- 
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ing that the amount of resources one has to pay in order to maintain representational power while 
trimming down layers of a network grows double exponentially w.r.t. the number of layers cut off. 
We also characterize cases in which dropping a single layer bears an exponential price. 

The remainder of the paper is organized as follows. In sec. 2 we briefly review notations and 
mathematical background required in order to follow our work. This is followed by sec. 3, which 
presents our convolutional arithmetic circuits and establishes their equivalence with tensor decom¬ 
positions. Our theoretical analysis is covered in sec. 4. Finally, sec. 5 concludes. In order to keep 
the manuscript at a reasonable length, we defer our detailed survey of related work to app. D, cov¬ 
ering works on the depth efficiency of boolean circuits, arithmetic circuits and neural networks, as 
well as different applications of tensor analysis in the field of deep learning. 

2. Preliminaries 

We begin by establishing notational conventions that will be used throughout the paper. We denote 
vectors using bold typeface, e.g. v G M^. The coordinates of such a vector are referenced with 
regular typeface and a subscript, e.g. Vi G M. This is not to be confused with bold typeface and a 
subscript, e.g. Vj G M^, which represents a vector that belongs to some sequence. Tensors (multi¬ 
dimensional arrays) are denoted by the letters “A” and “B” in calligraphic typeface, e.g. A,B^ 
--xMjv ^ specific entry in a tensor will be referenced with subscripts, e.g. Adi...di^ E 
Superscripts will be used to denote individual objects within a collection. For example, stands 
for vector i and A^ stands for tensor y. In cases where the collection of interest is indexed by 
multiple coordinates, we will have multiple superscripts referencing individual objects, e.g. a.^’b'r 
will stand for vector {l,j, t)- As shorthand for the Cartesian product of the Euclidean space M* with 
itself N times, we will use the notation . Finally, for a positive integer k we use the shorthand 
[k] to denote the set {1,..., k}. 

We now turn to establish a baseline, i.e. to present basic definitions and results, in the broad and 
comprehensive field of tensor analysis. We list here only the essentials required in order to follow 
the paper, referring the interested reader to Hackbusch (2012) for a more complete introduction to 
the field ^ The most straightforward way to view a tensor is simply as a multi-dimensional array: 

,^di^ E K where i G [A^], di G [Mi]. The number of indexing entries in the array, which are 
also called modes, is referred to as the order of the tensor. The term dimension stands for the number 
of values an index can take in a particular mode. For example, the tensor A appearing above has 
order N and dimension Mi in mode i,i G [N], The space of all possible configurations A can take 
is called a tensor space and is denoted, quite naturally, by 

A central operator in tensor analysis is the tensor product, denoted ®. This operator intakes 
two tensors A and B of orders P and Q respectively, and returns a tensor A® B of order P Q, 
defined by: {A ® B)^^ = Ad^...dp • Bdp_^_p..dp+Q- Notice that in the case P = Q = I, the 

tensor product reduces to an outer product between vectors. Specifically, v (g) u - the tensor product 
between u G and v G no other than the rank-1 matrix vu^ G jn this 

context, we will often use the shorthand ®fLi to denote the joint tensor product 

Tensors of the form ®fLi are called pure or elementary, and are regarded as having rank-1 
(assuming A 0 Vi). It is not difficult to see that any tensor can be expressed as a sum of rank-1 

1. The definitions we give are concrete special cases of the more abstract algebraic definitions given in Hackbusch 
(2012). We limit the discussion to these special cases since they suffice for our needs and are easier to grasp. 
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tensors: 

z 

A = Y^ 0 • • • ® , vW e (1) 

z=l 

A representation as above is called a CANDECOMP/PARAFAC decomposition of A, or in short, 
a CP decomposition The CP-rank of A is defined as the minimum number of terms in a CP 
decomposition, i.e. as the minimal Z for which eq. 1 can hold. Notice that for a tensor of order 2, 
i.e. a matrix, this definition of CP-rank coincides with that of standard matrix rank. 

A symmetric tensor is one that is invariant to permutations of its indices. Formally, a ten¬ 
sor A of order N which is symmetric will have equal dimension M in all modes, and for ev¬ 
ery permutation vr : [N] —)■ [N] and indices di.. .d^ C [M], the following equality will hold: 

Note that for a vector v G the tensor 0^ ^ v G ]^Mx--xM 
symmetric. Moreover, every symmetric tensor may be expressed as a linear combination of such 
(symmetric rank-1) tensors: A = Ylz=i ^2 • Vz 0 • • • 0 v^. This is referred to as a symmetric CP 
decomposition, and the symmetric CP-rank is the minimal Z for which such a decomposition exists. 
Since a symmetric CP decomposition is in particular a standard CP decomposition, the symmetric 
CP-rank of a symmetric tensor is always greater or equal to its standard CP-rank. Note that for the 
case of symmetric matrices (order-2 tensors) the symmetric CP-rank and the original CP-rank are 
always equal. 

A repeating concept in this paper is that of measure zero. More broadly, our analysis is framed 
in measure theoretical terms. While an introduction to the field is beyond the scope of the paper (the 
interested reader is referred to Jones (2001)), it is possible to intuitively grasp the ideas that form the 
basis to our claims. When dealing with subsets of a Euclidean space, the standard and most natural 
measure in a sense is called the Lebesgue measure. This is the only measure we consider in our 
analysis. A set of (Febesgue) measure zero can be thought of as having zero “volume” in the space 
of interest. For example, the interval between (0, 0) and (1, 0) has zero measure as a subset of the 
2D plane, but has positive measure as a subset of the ID x-axis. An alternative way to view a zero 
measure set S follows the property that if one draws a random point in space by some continuous 
distribution, the probability of that point hitting S is necessarily zero. A related term that will be 
used throughout the paper is almost everywhere, which refers to an entire space excluding, at most, 
a set of zero measure. 

3. Convolutional Arithmetic Circuits 

We consider the task of classifying an instance X = (xi,... ,XAr), x* G M^, into one of the 
categories y := {l,...,y}. Representing instances as collections of vectors is natural in many 
applications. In the case of image processing for example, X may correspond to an image, and 
xi .. .xtv rnay correspond to vector arrangements of (possibly overlapping) patches around pix¬ 
els. As customary, classification is carried out through maximization of per-label score func¬ 
tions {hy}y^y, i.e. the predicted label for the instance X will be the index y G 3^ for which the 
score value hy{X) is maximal. Our attention is thus directed to functions over the instance space 
A := {(xi,..., xat) : Xj G = (R^)^. We define our hypotheses space through the following 

2. CP decomposition is regarded as the classic and most basic tensor decomposition, dating back to the beginning of 
the 20’th century (see Kolda and Bader (2009) for a historic survey). 
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hidden layer 

input X representation 1x1 conv 



dense 
pooling (output) 


rep{i,d) = fg^{x.) 

conv(i,z) = (a^’‘ ,rep[i,:)'j 


out(y) = (sd ,pool (:)^ 
pool{z) = YY!^^conv{i,z) 


Figure 1: CP model - convolutional arithmetic circuit implementing CP (rank-1) decomposition, 
representation of score functions: 


M N 

/iy(xi,...,xjv) = (2) 

d\...d]s[=l i=l 

fdi ■ ■ -feM : K are referred to as representation functions, selected from a parametric family 

T={fg:W^R},^^. Natural choices for this family are wavelets, radial basis functions (Gaus- 
sians), and affine functions followed by point-wise activation (neurons). The coefficient tensor 
has order N and dimension M in each mode. Its entries correspond to a basis of point-wise 
product functions {(xi,... ,X 7 v) i—;■ Il^i fdd (^j)}rfi...dive[M]- We will often consider fixed lin¬ 
early independenf represenfafion funcfions .. -foM- this case fhe poinf-wise producf funcfions 
are linearly independenf as well (see app. C.l), and we have a one fo one correspondence befween 
score functions and coefficienf fensors. To keep fhe manuscripf concise, we defer fhe derivation of 
our hypofheses space (eq. 2) fo app. C, nofing here fhaf if arises nafurally from fhe nofion of fensor 
producfs befween spaces. 

Our evenfual aim is fo realize score funcfions hy wifh a layered nefwork archifecfure. As a 
firsf sfep along fhis pafh, we nofice fhaf hy(x.i ,..., xjv) is fully defermined by fhe activations of 
fhe M represenfafion funcfions /^j.. .fgj^j on fhe N inpuf vectors xi.. .xat. In ofher words, given 
{fog (xi)}rfg[ 7 v^] jg[ 7 v]> the score hy{'x.i ,..., xat) is independenf of fhe inpuf. If is fhus nafural fo con¬ 
sider fhe compufafion of fhese M-N numbers as fhe firsf layer of our nefworks. This layer, referred 
fo as fhe representation layer, may be conceived as a convolufional operator wifh M channels, each 
corresponding fo a differenf function applied fo all inpuf vectors (see fig. 1). 

Once we have consfrained our score funcfions fo have fhe sfrucfure depicfed in eq. 2, learning 
a classifier reduces to estimation of fhe parameters 9i.. .9 m, and fhe coefficienf fensors A^.. .,4.^. 
The compufafional challenge is fhaf fhe laffer fensors are of order N (and dimension M in each 
mode), having an exponenfial number of enfries (M^ each). In fhe nexf subsections we utilize 
fensor decomposifions (facforizafions) fo address fhis compufafional challenge, and show how fhey 
are nafurally realized by convolufional arifhmefic circuifs. 
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3.1. Shallow Network as a CP Decomposition of 

The most straightforward way to factorize a tensor is through a CP (rank-1) decomposition (see 
sec. 2). Consider a joint CP decomposition for the coefficient tensors 

z 

JJJ = Y^al- ® ® (3) 

2=1 

where a^ G for y G 3^ (a| stands for entry z of a^), and a^’* G for i G [A^], z G [Z]. The 
decomposition is joint in the sense that the same vectors a^’* are shared across all classes y. Clearly, 
if we set Z = this model is universal, i.e. any tensors A^.. .A^ may be represented. 

Substituting our CP decomposition (eq. 3) into the expression for the score functions in eq. 2, 
we obtain: 

Z N / M 

hyi^) = 

2=1 2=1 \d=l 

From this we conclude that the network illustrated in fig. 1 implements a classifier (score functions) 
under the CP decomposition in eq. 3. We refer to this network as CP model. The network consists 
of a representation layer followed by a single hidden layer, which in turn is followed by the output. 
The hidden layer begins with a 1 x 1 com operator, which is simply a 3D convolution with Z 
channels and receptive field 1x1. The convolution may operate without coefficient sharing, i.e. the 
filters that generate feature maps by sliding across the previous layer may have different coefficients 
at different spatial locations. This is often referred to in the deep learning community as a locally- 
connected operator (see Taigman et al. (2014)). To obtain a standard convolutional operator, simply 
enforce coefficient sharing by constraining the vectors a^’* in the CP decomposition (eq. 3) to be 
equal to each other for different values of i (this setting is discussed in sec. 3.3). Following conv 
operator, the hidden layer includes global product pooling. Feature maps generated by conv are 
reduced to singletons through multiplication of their entries, creating a vector of dimension Z. This 
vector is then mapped into the Y network outputs through a final dense linear layer. 

To recap, CP model (fig. 1) is a shallow (single hidden layer) convolutional arithmetic circuit 
that realizes the CP decomposition (eq. 3). It is universal, i.e. it can realize any coefficient tensors 
with large enough size (Z). Unfortunately, since the CP-rank of a generic tensor is exponential in 
its order (see Hackbusch (2012)), the size required for CP model to be universal is exponential (Z 
exponential in N). 

3.2. Deep Network as a Hierarchical Decomposition of A^ 

In this subsection we present a deep network that corresponds to the recently introduced Hierar¬ 
chical Tucker tensor decomposition (Hackbusch and Kiihn (2009)), which we refer to in short as 
HT decomposition. The network, dubbed HT model, is universal. Specifically, any set of tensors 
A^ represented by CP model can be represented by HT model with only a polynomial penalty in 
terms of resources. The advantage of HT model, as we show in sec. 4, is that in almost all cases 
it generates tensors that require an exponential size in order to be realized, or even approximated, 
by CP model. Put differently, if one draws the weights of HT model by some continuous distribu¬ 
tion, with probability one, the resulting tensors cannot be approximated by a polynomial CP model. 
Informally, this implies that HT model is exponentially more expressive than CP model. 
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hidden layer 0 

input X representation 1x1 conv 


pooling 



hidden layer L-1 
(L=log2N) 


rep{i,d) = fg^{x.) convo{j,r) = {a°",rep{j,:)) j pool^^_^{y)= W 

pool„{j,r)= convo{j\r) 


1x1 conv dense 

(output) 

)= Yl conv,_^{j',r) / 


oMi(y) = (a^^pooZ^_i(;)) 

Figure 2: HT model - convolutional arithmetic circuit implementing hierarchical decomposition. 


HT model is based on the hierarchical tensor decomposition in eq. 4, which is a special case 
of the HT decomposition as presented in Hackbusch and Kiihn (2009) (in the latter’s terminology, 
we restrict the matrices to be diagonal). Our construction and theoretical results apply to the 
general HT decomposition as well, with the specialization done merely to bring forth a network that 
resembles current convolutional networks 






Ay = 

The decomposition in eq. 4 recursively constructs the coefficient tensors {^^}y6[y] by assem¬ 
bling vectors {a°’-^’'>'}jg[ 7 v], 7 g[ro] into tensors { 4 >’‘’^’'^}i^[L-i],j^[N/ 2 ‘],'y&[ri] in an incremental fashion. 
The index I stands for the level in the decomposition, j represents the “location” within level I, and 
7 corresponds to the individual tensor in level I and location j. ri is referred to as level-l rank, 
and is defined to be the number of tensors in each location of level I (we denote for completeness 
rp := Y). The tensor has order 2^ and we assume for simplicity that N - the order of A.^, 
is a power of 2 (this is merely a technical assumption also made in Hackbusch and Kiihn (2009), it 
does not limit the generality of our analysis). 

The parameters of the decomposition are the final level weights {a^’^ G fhe in¬ 
termediate levels’ weights G the first level vectors {a^’-^’'’' G 

M^}jg[jv], 7 e[ro]- This totals at • M • ro -P ^ ■ n_i -^-1-1" • ri^i individual parame- 

3. If we had not constrained to be diagonal, pooling operations would involve entries from different channels. 


ro 




a=l 

ri-i 


^ ^ _ 


a=l 


order 2 ’' 


order 2^ 


rL-2 




-hi,7 ^L-2,2j-l,a ^ ^L-2,2j,a 


a=l 

rL-i 


order ^ 


order f 




iL-l,2,Q 


a=l 


order ^ order ^ 


(4) 
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ters, and if we assume equal ranks r := tq = • • • = the number of parameters becomes 

N ■ M ■ r + N ■ +Y ■ r. 

The hierarchical decomposition (eq. 4) is universal, i.e. with large enough ranks ri it can rep¬ 
resent any tensors. Moreover, it is a super-set of the CP decomposition (eq. 3). That is to say, 
all tensors representable by a CP decomposition having Z components are also representable by a 
hierarchical decomposition with ranks tq = ri = • • • = = Z Note that this comes with 

a polynomial penalty - the number of parameters increases from N ■ M ■ Z + Z ■ Y in the CP 
decomposition, to N ■ M ■ Z + Z ■ Y N ■ Z^ in the hierarchical decomposition. However, as we 
show in sec. 4, the gain in expressive power is exponential. 

Plugging the expression for in our hierarchical decomposition (eq. 4) into the score function 
hy given in eq. 2, we obtain the network displayed in fig. 2 - HT model. This network includes a 
representation layer followed by L = log 2 N hidden layers which in turn are followed by the output. 
As in the shallow CP model (fig. 1), fhe hidden layers consisf of 1 x 1 conv operators followed by 
producf pooling. The difference is fhaf insfead of a single hidden layer collapsing fhe enfire spafial 
sfrucfure fhrough global pooling, hidden layers now pool over size-2 windows, decimating fealure 
maps by a factor of fwo (no overlaps). Affer L = log 2 N such layers fealure maps are reduced fo 
singlefons, and we arrive al a ID sfrucfure wilh nodes. This is Ihen mapped info Y nelwork 
oulpuls fhrough a final dense linear layer. We nole lhal fhe nelwork’s size-2 pooling windows (and 
fhe resulfing number of hidden layers L = log 2 N) correspond fo fhe facl fhaf our hierarchical 
decomposition (eq. 4) is based on a full binary free over modes, i.e. if combines (fhrough tensor 
producf) fwo lensors al a lime. We focus on Ihis selling solely for simplicily of presenlalion, and 
since if is fhe one presenfed in Hackbusch and Kiihn (2009). Our analysis (sec. 4) could easily be 
adapled fo hierarchical decomposifions based on ofher frees (faking tensor producls belween more 
lhan fwo lensors al a lime), and lhal would correspond fo nelworks wilh differenl pooling window 
sizes and resulting deplhs. 

HT model (fig. 2) is conceplually divided info fwo parls. The firsl is fhe represenlafion layer, 
Iransforming inpul vectors xi.. .x^r info N-M real-valued scalars {fd^i'^i)}i^[N],d&[M]- The sec¬ 
ond and main par! of fhe nelwork, which we view as an “inference” engine, is fhe convolufional 
arilhmefic circuif fhaf lakes fhe N-M measuremenfs produced by fhe represenlafion layer, and ac¬ 
cordingly computes Y class scores al fhe oulpul layer. 

To recap, we have now a deep nelwork (fig. 2), which we refer to as HT model, lhal compules 
fhe score funclions hy (eq. 2) wilh coefficienl lensors A^ hierarchically decomposed as in eq. 4. 
The nelwork is universal in fhe sense lhal wilh enough channels ri, any lensors may be represented. 
Moreover, Ihe model is a super-sel of Ihe shallow CP model presented in sec. 3.1. The question of 
deplh efficiency now nalurally arises. In particular, we would like to know if Ihere are functions lhal 
may be represented by a polynomially sized deep HT model, yel require exponential size from fhe 
shallow CP model. The answer, as described in sec. 4, is lhal almosl all functions realizable by HT 
model meel Ihis property. In olher words, Ihe sel of functions realizable by a polynomial CP model 
has measure zero in Ihe space of functions realizable by a given polynomial HT model. 

4. To see this, simply assign the first level vectors with CP’s basis vectors, the last level weights with CP’s 

per-class weights, and the intermediate levels’ weights with indicator vectors. 
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3.3. Shared Coefficients for Convolution 

The 1x1 conv operator in our networks (see fig. 1 and 2) implements a local linear transformation 
with coefficients generally being location-dependent. In the special case where coefficients do not 
depend on location, i.e. remain fixed across space, fhe local linear Iransformalion becomes a sfan- 
dard convolufion. We refer fo fhis selling as coefficienl sharing. Sharing is a widely used sfruclural 
conslrainf, one of fhe pillars behind fhe successful convolutional nelwork archifeclure. In fhe con- 
fexl of image processing (prominenl application of convolutional nelworks), sharing is motivated by 
fhe observation lhal in nalural images, fhe semanlic conlenl of a paflern often does nol depend on ils 
localion. In fhis subseclion we explore fhe effecl of sharing on fhe expressiveness of our nelworks, 
or more specifically, on fhe coefficienl fensors Ihey can represenf. 

For CP model, coefficienl sharing amounfs fo selling ari := = • • • = in fhe CP 

decomposifion (eq. 3), Iransforming fhe latter fo a symmelric CP decomposilion: 

z 

N times 

CP model wifh sharing is no! universal (no! all fensors Ay are represenfable, no matter how large Z 
is allowed fo be) - if can only represenf symmelric fensors. 

In fhe case of HT model, sharing amounfs fo applying fhe following conslrainls on fhe hierarchi¬ 
cal decomposition in eq. 4: := a^’^’'>' = • • • = for every f = 0.. .L — 1 and 7 = 1.. .r;. 

Nole lhal in fhis case universalily is losl as well, buf nonefheless generafed fensors are nol limifed 
fo be symmelric, already demonslraling an expressive advanfage of deep models over shallow ones. 
In sec. 4 we lake fhis furlher by showing lhal Ihe shared HT model is exponentially more expressive 
lhan CP model, even if Ihe latter is nol conslrained by sharing. 

4. Theorems of Network Capacity 

The first contribution of this paper, presented in sec. 3, is the equivalence between deep learning 
architectures successfully employed in practice, and tensor decompositions. Namely, we showed 
that convolutional arithmetic circuits as in fig. 2, which are in fact SimNets that have demonstrated 
promising empirical performance (see app. E), may be formulated as hierarchical tensor decompo¬ 
sitions. As a second contribution, we make use of the established link between arithmetic circuits 
and tensor decompositions, combining theoretical tools from these two worlds, to prove results that 
are of interest to both deep learning and tensor analysis communities. This is the focus of the current 
section. 

The fundamental theoretical result proven in this paper is the following: 

Theorem 1 Let Ay be a tensor of order N and dimension M in each mode, generated by the 
recursive formulas in eq. 4. Define r := min{ro,M}, and consider the space of all possible 
configurations for the parameters of the composition - .y. In this space, the generated 

tensor Ay will have CP-rank of at least almost everywhere (w.r.t. Lebesgue measure). Put 
differently, the configurations for which the CP-rank of Ay is less than form a set of measure 
zero. The exact same result holds if we constrain the composition to be “shared”, i.e. set = 
and consider the space y configurations. 

From the perspective of deep learning, thm. 1 leads to the following corollary: 
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Corollary 2 Given linearly independent representation functions randomizing the 

weights of HT model (sec. 3.2) by a continuous distribution induces score functions hy that with 
probability one, cannot be approximated arbitrarily well (in sense) by a CP model (sec. 3.1) 
with less than min{ro, hidden channels. This result holds even if we constrain HT model 

with weight sharing (sec. 3.3) while leaving CP model in its general form. 

That is to say, besides a negligible set, all functions that can be realized by a polynomially sized 
HT model (with or without weight sharing), require exponential size in order to be realized, or even 
approximated, by CP model. Most of the previous works relating to depth efficiency (see app. D) 
merely show existence of functions that separate depths (i.e. that are efficiently realizable by a deep 
network yet require super-polynomial size from shallow networks). Corollary 2 on the other hand 
establishes depth efficiency for almost all funcfions fhaf a deep nefwork can implemenf. Equally 
imporfanfly, if applies fo deep learning archifecfures fhaf are being successfully employed in pracfice 
(SimNefs - see app. E). 

Adopfing fhe viewpoinf of fensor analysis, fhm. 1 slates fhaf besides a negligible sel, all tensors 
realized by HT (Hierarchical Tucker) decomposilion cannol be represented by fhe classic CP (rank- 
1) decomposilion if the latter has less than an exponential number of terms To the best of our 
knowledge, this result has never been proved in the tensor analysis community. In the original 
paper introducing HT decomposition (Hackbusch and Kiihn (2009)), as a motivating example, the 
authors present a specific fensor fhaf is efficienlly realizable by HT decomposilion while requiring 
an exponential number of terms from CP decomposilion Our resull sfrengfhens Ihis molivalion 
considerably, showing fhaf if is nof jusl one specific fensor fhaf favors HT over CP, buf rafher, almosl 
all tensors realizable by HT exhibif Ihis preference. Taking info accounl fhaf any tensor realized by 
CP can also be realized by HT wilh only a polynomial penally in fhe number of paramelers (see 
sec. 3.2), this implies that in an asymptotic sense, HT decomposition is exponentially more efficient 
than CP decomposition. 

4.1. Proof Sketches 

The complete proofs of thm. 1 and corollary 2 are given in app. B. We provide here an outline of 
the main tools employed and arguments made along these proofs. 

To prove thm. 1 we combine approaches from the worlds of circuit complexity and tensor de¬ 
compositions. The first class of machinery we employ is matrix algebra, which has proven to be a 
powerful source of tools for analyzing the complexity of circuits. Eor example, arithmetic circuits 
have been analyzed through what is called the partial derivative matrix (see Raz and Yehudayoff 
(2009)), and for boolean circuits a widely used tool is the communication matrix (see Karchmer 
(1989)). We gain access to matrix algebra by arranging tensors that take part in the CP and HT 
decompositions as matrices, a process often referred to as matricization. With matricization, the 
tensor product translates to the Kronecker product, and the properties of the latter become readily 
available. The second tool-set we make use of is measure theory, which prevails in the study of ten¬ 
sor decompositions, but is much less frequent in analyses of circuit complexity. In order to frame 

5. As stated in sec. 3.2, the decomposition ineq. 4 to which thm. 1 applies is actually a special case of HT decomposition 
as introduced in Hackbusch and Kiihn (2009). However, the theorem and its proof can easily be adapted to account 
for the general case. We focus on the special case merely because it corresponds to convolutional arithmetic circuit 
architectures used in practice. 

6. The same motivating example is given in a more recent textbook introducing tensor analysis (Hackbusch (2012)). 
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a problem in measure theoretical terms, one obviously needs to define a measure space of inter¬ 
est. For tensor decompositions, the straightforward space to focus on is that of the decomposition 
variables. For general circuits on the other hand, it is often unclear if defining a measure space is 
af all appropriate. However, when circuifs are considered in fhe confexf of machine learning fhey 
are usually parameferized, and defining a measure space on fop of fhese paramefers is an effecfive 
approach for sfudying fhe prevalence of various properties in hypofheses spaces. 

Our proof of fhm. 1 traverses through the following path. We begin by showing that matricizing 
a rank-1 tensor produces a rank-1 matrix. This implies that the matricization of a tensor generated by 
a CP decomposition with Z terms has rank at most Z. We then turn to show that the matricization 
of a tensor generated by the HT decomposition in eq. 4 has rank at least min{ro, almost 

everywhere. This is done through induction over the levels of the decomposition {1 = 1.. .L). For 
the first level {I = 1), we use a combination of measure theoretical and linear algebraic arguments 
to show that the generated matrices have maximal rank (min{ro, M}) almost everywhere. For the 
induction step, the facts that under matricization tensor product translates into Kronecker product, 
and that the latter increases ranks multiplicatively imply that matricization ranks in the current 
level are generally equal to those in the previous level squared. Measure theoretical claims are then 
made to ensure that this indeed takes place almost everywhere. 

To prove corollary 2 based on thm. 1, we need to show that the inability of CP model to realize a 
tensor generated by HT model, implies that the former cannot approximate score functions produced 
by the latter. In general, the set of tensors expressible by a CP decomposition is not topologically 
closed which implies that a-priori, it may be that CP model can approximate tensors generated 
by HT model even though it cannot realize them. However, since the proof of thm. 1 was achieved 
through separation of matrix rank, distances are indeed positive and CP model cannot approximate 
HT model’s tensors almost always. To translate from tensors to score functions, we simply note 
that in a finite-dimensional Hilbert space convergence in norm implies convergence in coefficients 
under any basis. Therefore, in the space of score functions (eq. 2) convergence in norm implies 
convergence in coefficients under the basis {(xi,..., yi^)^ fda- (^j)}(ii...djve[M]- That is to 
say, it implies convergence in coefficient tensors. 

4.2. Generalization 

Thm. 1 and corollary 2 compare the expressive power of the deep HT model (sec. 3.2) to that of 
the shallow CP model (sec. 3.1). One may argue that such an analysis is lacking, as it does not 
convey information regarding the importance of each individual layer. In particular, it does not shed 
light on the advantage of very deep networks, which at present provide state of the art recognition 
accuracy, compared to networks of more moderate depth. For this purpose we present a generaliza¬ 
tion, specifying the amount of resources one has to pay in order to maintain representational power 
while layers are incrementally cut off from a deep network. For conciseness we defer this analysis 
to app. A, and merely state here our final conclusions. We find fhaf fhe represenfafional penalfy is 
double exponenfial w.r.f. fhe number of layers removed. In addifion, fhere are cerfain cases where 
fhe removal of even a single layer leads fo an exponenfial inllalion, falling in line wifh fhe suggestion 
of Bengio (2009). 

7. If © denotes the Kronecker product, then for any matrices A and B: rank{AQB) = rank{A)-rank{B). 

8. Hence the definition of border rank, see Hackbusch (2012). 
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5. Discussion 

In this work we address a fundamental issue in deep learning - the expressive efficiency of depth. 
There have been many attempts to theoretically analyze this question, but from a practical machine 
learning perspective, existing results are limited. Most of the results apply to very specific fypes of 
nefworks fhat do not resemble ones used in practice, and none of the results account for the locality- 
sharing-pooling paradigm which forms the basis for convolutional networks - the most successful 
deep learning architecture to date. In addition, current analyses merely show existence of depth 
efficiency, i.e. of funcfions that are efficiently realizable by deep networks but not by shallow ones. 
The practical implications of such findings are arguably slight, as a-priori, it may be that only a 
small fraction of the functions realizable by deep networks enjoy depth efficiency, and for all the 
rest shallow networks suffice. 

Our aim in this paper was to develop a theory that facilitates an analysis of depth efficiency for 
nefworks that incorporate the widely used structural ingredients of locality, sharing and pooling. 
We consider the task of classification into one of a finite set of categories y = Our 

instance space is defined to be the Cartesian product of N vector spaces, in compliance with the 
common practice of representing natural data through ordered local structures (e.g. images through 
patches). Each of the N vectors that compose an instance is represented by a descriptor of length M, 
generated by running the vector through M “representation” functions. As customary, classification 
is achieved through maximization of score functions hy, one for every category y £ y. Each score 
function is a linear combination over the possible products that may be formed by taking one 
descriptor entry from every input vector. The coefficients for these linear combinations conveniently 
reside in tensors of order N and dimension M along each axis. We construct networks that 
compute score functions hy by decomposing (factorizing) the coefficient tensors . The resulting 
networks are convolutional arithmetic circuits that incorporate locality, sharing and pooling, and 
operate on the N-M descriptor entries generated from the input. 

We show that a shallow (single hidden layer) network realizes the classic CP (rank-1) tensor 
decomposition, whereas a deep network with log 2 N hidden layers realizes the recently introduced 
Hierarchical Tucker (HT) decomposition (Hackbusch and Kiihn (2009)). Our fundamental result, 
presented in thm. 1 and corollary 2, states that randomizing the weights of a deep network by some 
continuous distribution will lead, with probability one, to score functions that cannot be approx¬ 
imated by a shallow network if the latter’s size is not exponential (in N). We extend this result 
(thm. 3 and corollary 4) by deriving analogous claims that compare two networks of any depths, not 
just deep vs. shallow. 

To further highlight the connection between our networks and ones used in practice, we show 
(app. E) that translating convolution and product pooling computations to log-space (for numerical 
stability) gives rise to SimNets - a recently proposed deep learning architecture which has been 
shown to produce state of the art accuracy in computationally limited settings (Cohen et al. (2016)). 

Besides the central line of our work discussed above, the construction and theory presented in 
this paper shed light on various conjectures and practices employed by the deep learning community. 
Eirst, with respect to the pooling operation, our analysis points to the possibility that perhaps it has 
more to do with factorization of computed functions than it does with translation invariance. This 
may serve as an explanation for the fact that pooling windows in state of the art convolutional 
networks are typically very small (see for example Simonyan and Zisserman (2014)), often much 
smaller than the radius of translation one would like to be invariant to. Indeed, in our framework, as 
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we show in app. A, pooling over large windows and trimming down a network’s depth may bring to 
an exponential decrease in expressive efficiency. 

The second point our theory sheds light on is sharing. As discussed in sec. 3.3, introducing 
weight sharing to a shallow network (CP model) considerably limits its expressive power. The net¬ 
work can only represent symmetric tensors, which in turn means that it is location invariant w.r.t. 
input vectors (patches). In the case of a deep network (HT model) the limitation posed by sharing is 
not as strict. Generated tensors need not be symmetric, implying that the network is capable of mod¬ 
eling location - a crucial ability in almost any real-world task. The above findings suggest that the 
sharing constraint is increasingly limiting as a network gets shallower, to the point where it causes 
complete ignorance to location. This could serve as an argument supporting the empirical success 
of deep convolutional networks - they bind together the statistical and computational advantages of 
sharing with many layers that mitigate its expressive limitations. 

Lastly, our construction advocates locality, or more specifically, 1x1 receptive fields. Recent 
convolutional networks providing state of the art recognition performance (e.g. Lin et al. (2014); 
Szegedy et al. (2015)) make extensive use of 1 x 1 linear transformations, proving them to be 
very successful in practice. In view of our model, such 1x1 operators factorize tensors while 
providing universality with a minimal number of parameters. It seems reasonable to conjecture that 
for this task of factorizing coefficient tensors, larger receptive fields are not significantly helpful, 
as they lead to redundancy which may deteriorate performance in presence of limited training data. 
Investigation of this conjecture is left for future work. 
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Appendix A. Generalized Theorem of Network Capacity 

In sec. 4 we presented our fundamental theorem of network capacity (thm. 1 and corollary 2), showing that 
besides a negligible set, all functions that can be realized by a polynomially sized HT model (with or without 
weight sharing), require exponential size in order to be realized, or even approximated, by CP model. In 
terms of network depth, CP and HT models represent the extremes - the former has only a single hidden 
layer achieved through global pooling, whereas the latter has L — log 2 N hidden layers achieved through 
minimal (size-2) pooling windows. It is of interest to generalize the fundamental result by establishing a 
comparison between networks of intermediate depths. This is the focus of the current appendix. 

We begin by defining a truncated version of the hierarchical tensor decomposition presented in eq. 4: 

^lj,7 = 




A = 


The only difference between this decomposition and the original is that instead of completing the full process 
with L := log 2 N levels, we stop after Lc<L. At this point remaining tensors are binded together to form 
the final order-W tensor. The corresponding network will simply include a premature global pooling stage 
that shrinks feature maps to 1 x 1, and then a final linear layer that performs classification. As before, we 
consider a shared version of the decomposition in which = ^^7 jsjotice that this construction realizes 
a continuum between CP and HT models, which correspond to the extreme cases Lc = 1 and Lc = L 
respectively. 

The following theorem, a generalization of thm. 1, compares a truncated decomposition having Li levels, 
to one with L 2 < Li levels that implements the same tensor, quantifying the penalty in terms of parameters: 

Theorem 3 Let and be tensors of order N and dimension M in each mode, generated by 
the truncated recursive formulas in eq. 5, with Li and L 2 levels respectively. Denote by and 

f/ie composition ranks ofA^^'^ and A^^'^ respectively. Assuming w.l.o.g. that Li > L 2 , we define 
r := min{rQ^\ M}, and consider the space of all possible configurations for the parameters of 

A^^'^ ’s composition - ^ fhig space, almost everywhere (w.r.t. Lebesgue measure), the gener¬ 
ated tensor A^^'^ requires that ^ if one wishes that A^^^ be equal to A^^^\ Put differently, the 

configurations for which can be realized by with ^ form a set of measure zero. The 

exact same result holds if we constrain the composition of A^^'^ to be “shared”, i.e. set 
and consider the space configurations. 

In analogy with corollary 2, we obtain the following generalization: 

Corollary 4 Suppose we are given linearly independent representation functions fe^. • .foM’ consider 
two networks that correspond to the truncated hierarchical tensor decomposition in eq. 5, with Li and L 2 
hidden layers respectively. Assume w.l.o.g. that Li > L 2 , i.e. that network 1 is deeper than network!, and 
define r to be the minimal number of channels across the representation layer and the first L 2 hidden layers 
of network 1. Then, if we randomize the weights of network 1 by a continuous distribution, we obtain, with 


ro 

0.-1 


n-1 

OL—l 


order 


order 2 


S 4-' ® 
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-Lc-l.i.a 
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probability one, score functions hy that cannot be approximated arbitrarily well (in sense) by network 2 
if the latter has less than (r)^ ^ channels in its last hidden layer. The result holds even if we constrain 

network 1 with weight sharing while leaving network 2 in its general form. 

Proofs of thm. 3 and corollary 4 are given in app. B. Hereafter, we briefly discuss some of their impli¬ 
cations. First, notice that we indeed obtain a generalization of the fundamental theorem of network capacity 
(thm. 1 and corollary 2), which corresponds to the extreme case Li = L and L 2 = 1. Second, note that for the 
baseline case of Li = L, i.e. a full-depth network has generated the target score function, approximating this 
with a truncated network draws a price that grows double exponentially w.r.t. the number of missing layers. 
Third, and most intriguingly, we see that when Li is considerably smaller than L, i.e. when a significantly 
truncated network is sufficient to model our problem, cutting off even a single layer leads to an exponential 
price, and this price is independent of Li. Such scenarios of exponential penalty for trimming down a single 
layer were discussed in Bengio (2009), but only in the context of specihc functions realized by networks that 
do not resemble ones used in practice (see Hastad and Goldmann (1991) for an example of such result). We 
prove this in a much broader, more practical setting, showing that for convolutional arithmetic circuit (Sim- 
Net - see app. E) architectures, almost any function realized by a signihcantly truncated network will exhibit 
this behavior. The issue relates to empirical practice, supporting the common methodology of designing net¬ 
works that go as deep as possible. Specifically, it encourages extending network depth by pooling over small 
regions, avoiding significant spatial decimation that brings network termination closer. 

We conclude this appendix by stressing once more that our construction and theoretical approach are 
not limited to the models covered by our theorems (CP model, HT model, truncated HT model). These are 
merely exemplars deemed most appropriate for initial analysis. The fundamental and generalized theorems 
of network capacity are similar in spirit, and analogous theorems for networks with different pooling window 
sizes and depths (corresponding to different tensor decompositions) may easily be derived. 


Appendix B. Proofs 

B.l. Proof of Theorem s 1 and 3 

Our proof of thm. 1 and 3 relies on basic knowledge in measure theory, or more specifically, Lebesgue 
measure spaces. We do not provide here a comprehensive background on this held (the interested reader is 
referred to Jones (2001)), but rather supplement the brief discussion given in sec. 2, with a list of facts we 
will be using which are not necessarily intuitive: 

• A union of countably (or hnitely) many sets of zero measure is itself a set of zero measure. 

• If p is a polynomial over d variables that is not identically zero, the set of points in in which it 
vanishes has zero measure (see Caron and Traynor (2005) for a short proof of this). 

• If S' C has zero measure, then S x C and every set contained within, have zero 

measure as well. 

In the above, and in the entirety of this paper, the only measure spaces we consider are Euclidean spaces 
equipped with Lebesgue measure. Thus when we say that a set of d-dimensional points has zero measure, we 
mean that its Lebesgue measure in the d-dimensional Euclidean space is zero. 

Moving on to some preliminaries from matrix and tensor theory, we denote by [,A] the matricization of 
an order-tensor A (for simplicity, N is assumed to be even), where rows correspond to odd modes and 
columns correspond to even modes. Namely, if ^ S x-xMn^ j-jjg matrix [A] has ... -Mm-i 

rows and M 2 -M 4 -... -Mj^ columns, rearranging the entries of the tensor such that Adi...dN stored in row 

index 1 + Ilpfi-i-i ^ 2 j-i and column index 1+I]2i('^2i-1) Y{"=i+i To distinguish 

from the tensor product operation 0, we denote the Kronecker product between matrices by ©. Specihcally, 
for two matrices A € and B G Aq B is the matrix in ^MiNixM 2 N 2 holds A^jB^i 

in row index [i — l)Ni + k and column index (j — 1)A^2 + ^- The basic relation that binds together tensor 
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product, matricization and Kronecker product is [A^ B] = [A] © [B], where A and B are tensors of even 
orders. Two additional facts we will make use of are that the matricization is a linear operator (i.e. for scalars 
ai-. .Ur and tensors with the same size Ai -. -Ar'- [J2i=i ^nd less trivially, that for 

any matrices A and B, the rank of A 0 i? is equal to rank{A) ■ rank{B) (see Bellman et al. (1970) for a 
proof). These two facts, along with the basic relation laid out above, lead to the conclusion that: 


and thus: 


rank 


^2^ 


72 


'' 2i — 1 ^ 2i 


= rank 


' 21-1 


'2i 


= 1 


rank 


Ea 




2 v 1 


7") 


= rank Az 




— E/ 


AA 


7^) 


= z 


In words, an order-2^ tensor given by a CP-decomposition (see sec. 2) with Z terms, has matricization with 
rank at most Z. Thus, to prove that a certain order-2^ tensor has CP-rank of at least R, it suffices to show 
that its matricization has rank of at least R. 

We now state and prove two lemmas that will be needed for our proofs of thm. 1 and 3. 


Lemma 5 Let M, N G N, and define the following mapping taking x G matrices: 

A(x) G B{x) G and 79(x) G A(x) simply holds the first MN elements ofx, B{x.) 

holds the following MN elements ofx, and D(pf) is a diagonal matrix that holds the last N elements ofx. on 
its diagonal. Define the product matrix (7(x) := Afx.)D{x)B{x)^ G , and consider the set of points 

xfor which the rank ofU{x) is different from r := min{M, N}. This set of points has zero measure. The 
result will also hold if the points x reside in and the same elements are used to assign A(x) and 

B{x) (A(x) = B{x)). 


Proof Obviously rank{U{x)) < r for all x, so it remains to show that rank{U{x)) > r for all x but a 
set of zero measure. Let Ur(x) be the top-left r x r sub-matrix of U{x). If C4(x) is non-singular then of 
course rank{U{x)) > r as required. It thus suffices to show that the set of points x for which det Ur{x) = 0 
has zero measure. Now, det Ur{x) is a polynomial in the entries of x, and so it either vanishes on a set of 
zero measure, or it is the zero polynomial (see Caron and Traynor (2005)). All that is left is to disqualify 
the latter option, and that can be done by finding a specific point Xq for which det Ur{xo) f 0. Indeed, 
we may choose Xq such that D{xq) is the identity matrix and A(xo), B{xq) hold 1 on their main diagonal 
and 0 otherwise. This selection implies that Ur{xo) is the identity matrix, and in particular det Ur{xo) 7 ^ 0. ■ 


Lemma 6 Assume we have p continuous mappings from to taking the point y to the matri¬ 

ces Ai(y).. .Ap(y). Assume that under these mappings, the points y for which every i G [p] satisfies 
rank{Ai{y)) < r form a set of zero measure. Define a mapping from M.P X to giyga by 

(x,y) i-A- A(x, y) := ’ ^i(y)- Then, the points (x,y) for which rank{A{x,y)) < r form a 

set of zero measure. 

Proof Denote S := {(x, y) : rank{A{x, y)) < r} C x We would like to show that this set has zero 
measure. We first note that since A(x, y) is a continuous mapping, and the set of matrices A G which 

have rank less than r is closed. S' is a closed set and in particular measurable. Our strategy for computing its 
measure will be as follows. Lor every y S we define the marginal set S^ := {x : rank{A{x, y)) < r} C 
We will show that for every y but a set of zero measure, the measure of S^ is zero. An application of 
Lubini’s theorem will then prove the desired result. 
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Let C be the set of points y S for which Vi G [p] : rank{Ai{y)) < r. By assumption, C has zero 
measure. We now show that for yo G \ C, the measure of is zero. By the definition of C there exists 
an i G [p] such that rank{Ai{yQ)) > r. W.l.o.g., we assume that i = 1, and that the top-left r x r sub-matrix 
of ^i(yo) is non-singular. Regarding yo as fixed, the determinant of the top-left rxr sub-matrix of A(x, yg) 
is a polynomial in the elements of x. It is not the zero polynomial, as setting xi = 1 ,X 2 = ■ ■ ■ = Xp = 0 
yields A(x, yg) = Ai(yg), and the determinant of the latter’s top-left rxr sub-matrix is non-zero. As a 
non-zero polynomial, the determinant of the top-left rxr sub-matrix of A(x, yg) vanishes only on a set of 
zero measure (Caron and Traynor (2005)). This implies that indeed the measure of 3^° is zero. 

We introduce a few notations towards our application of Fubini’s theorem. First, the symbol 1 will be 
used to represent indicator functions, e.g. Ig is the function from x to K that receives 1 on S' and 0 
elsewhere. Second, we use a subscript of n G N to indicate that the corresponding set is intersected with the 
hyper-rectangle of radius n. For example, S„ stands for the intersection between S and [—n, and 

stands for the intersection between and [—n, (which is equal to the latter). All the sets we consider 
are measurable, and those with subscript n have finite measure. We may thus apply Fubini’s theorem to get: 

/ ls„ = / ls= f f lsy= [ f lsy+ [ f Isv 

J{x,y) "'yGK^ -fxGRS -fyGH^nC JxGHS -^yGR^VC ^xGRS 

Recall that the set C G has zero measure, and for every y ^ C the measure of 5^ G is zero. This 
implies that both integrals in the last expression vanish, and thus J 15 ^ =0. Finally, we use the monotone 
convergence theorem to compute J Ig: 


/l.= 

/ lim 

= lim 

/ ls„ = lim 0 = 0 

/ J 

r n—>-00 

n—>-oo 

f ^ n—^oo 


This shows that indeed our set of interest S has zero measure. ■ 


With all preliminaries and lemmas in place, we turn to prove thm. 1 , establishing an exponential efficiency 
of HT decomposition (eq. 4) over CP decomposition (eq. 3). 

Proof [of theorem 1] We begin with the case of an “unshared” composition, i.e. the one given in eq. 4 (as 
opposed to the “shared” setting of = a*’’’'). Denoting for convenience ;= jy ^nd = 1, we 

will show by induction over I = 1, ...,L that almost everywhere (at all points but a set of zero measure) 
w.r.t. {a*V,7|j all CP-ranks of the tensors at least In accordance with our 

discussion in the beginning of this subsection, it suffices to consider the matricizations and show that 

these all have ranks greater or equal to almost everywhere. 

For the case Z = 1 we have: 

ro 

J.7 = a;^4,7a0.2i-l.a ^ g^0,2j,a 

Q —1 

Denote by A G M^fxro jjjg matrix with columns by ^ g ]jMxro matrix with columns 

|a 0 , 2 j,a|n)^^^ and by Z? G the diagonal matrix with a^’V7 on its diagonal. Then, we may write 

[(^1 j, 7 ] = ADB^, and according to lemma 5 the rank of equals r := min{rg, M} almost everywhere 

w.r.t. Xo see that this holds almost everywhere w.r.t. {a^’-^Aj^ one 

should merely recall that for any dimensions di,d 2 G N, if the set S C has zero measure, so does 
any subset of 5” x C A hnite union of zero measure sets has zero measure, thus the fact that 

ranklip^d.-y^ = p holds almost everywhere individually for any j G [^/' 2 ] and 7 G [ri], implies that it holds 
almost everywhere jointly for all j and 7 . This proves our inductive hypothesis (unshared case) for 1 = 1. 

Assume now that almost everywhere rank[ 4 >^~'^d ]>r^ for all j' G and 7 ' G For 

some specific choice of j G [^/ 2 '] and 7 G [n] we have: 

n-l r-l-l 

a—1 oc—1 


20 


On the Expressive Power of Deep Learning: A Tensor Analysis 


Denote Mq, for a = 1.. .rj-i. By our inductive assumption, and by the general 

property rank{A Q B) = rank{A)-rank{B), we have that almost everywhere the ranks of all matrices 
are at least A = A. Writing ■ Ma, and noticing that {Ma\ do not 

depend on ^ve turn our attention to lemma 6. The lemma tells us that rank[<j)^’k'y^ > A almost 

everywhere. Since a finite union of zero measure sets has zero measure, we conclude that almost everywhere 
rank[<j)^'k^] > A holds jointly for all j G [^/2'] and 7 G [n]. This completes the proof of the theorem in 
the unshared case. 

Proving the theorem in the shared case may be done in the exact same way, except that for Z = 1 one 
needs the version of lemma 5 for which A(x) and B(x) are equal. ■ 


We now head on to prove thm. 3, which is a generalization of thm. 1. The proof will be similar in nature 
to that of thm. 1, yet slightly more technical. In short, the idea is to show that in the generic case, expressing 
as a sum of tensor products between tensors of order requires at least r^A^^ Since ,4A) is 

expressed as a sum of r^j-i such terms, demanding 41^) = ^A) implies ^ 

To gain technical advantage and utilize known results from matrix theory (as we did when proving 
thm. 1), we introduce a new tensor “squeezing” operator (p. For q G N, (pq is an operator that receives a 
tensor with order divisible by q, and returns the tensor obtained by merging together the latter’s modes in 
groups of size q. Specifically, when applied to the tensor A G « (c g N), (pq returns a ten¬ 

sor of order c which holds Adi...do. in the location defined by the following index for every mode t G [c]: 
1 + I^i=i(di+g(t-i) - 1) ]Xj=i+i AIjj^q(t_iy Notice that when applied to a tensor of order q, ipq returns a 
vector. Also note that if A and B are tensors with orders divisible by q, and A is a scalar, we have the desirable 
properties: 

• fqiA ®B) = (Pq{A) ® P>q{B) 

• (Pq{XA + B) = X(Pq{A) + iPq{B) 

For the sake of our proof we are interested in the case q = and denote for brevity ip := ip 2 L^-i. 

As stated above, we would like to show that in the generic case, expressing AlA) as ® ® 

where are tensors of order 2^'^~^, implies Z Applying Lp to both sides of such a 

decomposition gives: y)(AlA)) = ® ® )’ where ip{(j)f^) are now vectors. Thus, 

to prove thm. 3 it suffices to show that in the generic case, the CP-rank of ip{A^^'^) is at least r”A^ 2 ^ 
alternatively, that the rank of the matricization [(^(AlA))] is at least r”A^2 xhis will be our strategy in the 
following proof: 

Proof [of theorem 3] In accordance with the above discussion, it suffices to show that in the generic case 
rank[ip{A ^^'^)] > r”A ^2 g^^g ^j^g j-g^^jgj-^ y^g reformulate the problem using slightly simpler 

notations. We have an order-W tensor A with dimension M in each mode, generated as follows: 

ro 

J.7 = a;^A,7a0.2j-i.a ^ 

Q. — 1 


aIj.7 


l-j,l Ai-L2i-l,a ^ ,1-1,2j,a 


Oc — 1 




order 2‘ 


order 2‘ 


A = 


OL — l ^ ^ j ^ 

order 2^1 
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where: 

• Li < L ■= log2 N 

• ro, G N>o 

• G for j G [N] and a G [ro] 

• G K’’*-! for I G [Li — 1 ], j G ['^/2'] and 7 G [n] 

Let L 2 be a positive integer smaller than Li, and let ip be the tensor squeezing operator that merges groups 
of modes. Define r := min{ro,r^j-i, M}. With [•] being the matricization operator defined 

in the beginning of the appendix, our task is to prove that rank[ip{A)] > almost everywhere w.r.t. 

{a* J We also consider the case of shared parameters - = a*’^, where we would like to show 

that the same condition holds almost everywhere w.r.t. {a^ '*'}; 

Our strategy for proving the claim is inductive. We show that for I = L2. ■ -Li — 1 , almost everywhere 
it holds that for all j and all 7: rank[p{(j)^’^’'^)] > A ^. We then treat the special case of I = Li, showing 
that indeed rank[p{A)] > . We begin with the setting of unshared parameters (a*’-^’'*'), and afterwards 

attend the scenario of shared parameters (a* ''') as well. 

Our first task is to treat the case I = L2, i.e. show that rank[ip{(j)^^'^’'^)] > r almost everywhere jointly 
for all j and all 7 (there is actually no need for the matricization [•] here, as are already matrices). 

Since a union of finitely many zero measure sets has zero measure, it suffices to show that this condition 
holds almost everywhere when specific j and 7 are chosen. Denote by a vector holding 1 in entry i and 0 
elsewhere, by 0 a vector of zeros, and by 1 a vector of ones. Suppose that for every j we assign to be 
Ba when a < r and 0 otherwise. Suppose also that for all 1 < Z < L 2 — 1 and all j we set to be 
when 7 < r and 0 otherwise. Finally, assume we set = 1 for all j and all 7. These settings imply that 

for every j, when j < r we have (^^2-1 j,7 = ® e^), i.e. the tensor holds 1 in location 

(7,..., 7) and 0 elsewhere. If 7 > r then ^^2-1,i.7 the zero tensor. We conclude from this that there are 

indices 1 < ii < ... < ir < such that j,7) = for 7 < r, and that for 7 > r we have 


1 j'.7) = 0 ^ \ye jnay thus write: 



AI-2-1 \ 

1 - 1 , 2-1 

r 

J.7) = tp ^ ^L2-1.2j-l,a ^ ^L2-1.2j.a 

Oc—1 

Ct — l 


Now, since ii.. .ir are different from each other, the matrix 7)(^^2j,7) has rank r. This however does not 
prove our inductive hypothesis for I = L 2 . We merely showed a specific parameter assignment for which 
it holds, and we need to show that it is met almost everywhere. To do so, we consider an r x r sub-matrix 
of J.7) which is non-singular under the specific parameter assignment we defined. The determinant 

of this sub-matrix is a polynomial in the elements of j which we know does not vanish with the 

specific assignments defined. Thus, this polynomial vanishes at subset of {a^ ■Z’'''}; jyy having zero measure 
(see Caron and Traynor (2005)). That is to say, the sub-matrix of has rank r almost everywhere, 

and thus has rank at least r almost everywhere. This completes our treatment of the case I = L 2 . 

We now turn to prove the propagation of our inductive hypothesis. Let I G {L 2 + 1, Li — 1}, and 
assume that our inductive hypothesis holds for I — 1. Specifically, assume that almost everywhere w.r.t. 

J we have that rank[ip{(j}''~^’ki'ij > j -2 2 j g and all 7 G [n_i]. 

We would like to show that almost everywhere, ranfc[(^(^*’-'’^)] > A ^ jointly for all j G [^/2‘] and all 
7 G [r;]. Again, the fact that a finite union of zero measure sets has zero measure implies that we may 
prove the condition for specific j G [^/a'j and 7 G [r/j. Applying the squeezing operator p followed by 
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matricization [•] to the recursive expression for we get: 




\a=l 


il-l,2j,oc 


n-i 

E 

.Q!=l 








l-l,2j,a\ 


ri-i 

E*" 

Q!=l 






)] © [<p(0‘ 


Z-l,2j,a 


)] 


For a = 1 .. .ri-i, denote the matrix [i~p{ 4 >^ ^’“)] © [</?(</'* i^y ]\/[^^ -pjjg j-jjg Kronecker 

product multiplies ranks, along with our inductive assumption, imply that almost everywhere rank{Ma) > 
j,2 1 ^2 ^2' ^ ^2 _ ^2' ^2 jsjQfjjjg fjjg matrices do not depend on we apply lemma 6 

and conclude that almost everywhere rank\ip{(j)^’^’^)\ > which completes the prove of the inductive 

propagation. 

Next, we treat the special case I = Li. We assume now that almost everywhere rank[ip{(j>^^~^’^’'^)] > 
^ jointly for all j and all 7. Again, we apply the squeezing operator tp followed by matricization [•], 


this time to both sides of the expression for A: 

rLi-l 




i^L — Li + l 

© 
i=i 


b(</' 




As before, denote Ma '■= for a = Using again the multiplicative 

rank property of the Kronecker product along with our inductive assumption, we get that almost everywhere 


rank{Ma) > n,=i 


^2^1" 


= r 


L-L 2 


. Noticing that {Ma}aGlrLj^-i] depend on we 


apply lemma 6 for the last time and get that almost everywhere (w.r.t. j^y), the rank of [</j(. 4 )] is at 

least . This completes our proof in the case of unshared parameters. 

Proving the theorem in the case of shared parameters (a* ■’A = a* ''') can be done in the exact same way 
as above. In fact, all one has to do is omit the references to j and the proof will apply. Notice in particular 
that the specific parameter assignment we defined to handle I = L2 was completely symmetric, i.e. it did not 
include any dependence on j. ■ 


B.2. Proof of Corollaries 2 and 4 

Corollaries 2 and 4 are a direct continuation of thm. 1 and 3 respectively. In the theorems, we have shown 
that almost all coefficient tensors generated by a deep network cannot be realized by a shallow network if 
the latter does not meet a certain minimal size requirement. The corollaries take this further, by stating that 
given linearly independent representation functions fg^.. not only is efficient realization of coefficient 
tensors generally impossible, but also efficient approximation of score functions. To prove this extra step, 
we recall from the proofs of thm. 1 and 3 (app. B.l) that in order to show separation between the coefficient 
tensor of a deep network and that of a shallow network, we relied on matricization rank. Specifically, we 
derived constants € N, R^ > R^, such that the matricization of a deep network’s coefficient tensor 

had rank greater or equal to R^, whereas the matricization of a shallow network’s coefficient tensor had rank 
smaller or equal to R^. Given this observation, corollaries 2 and 4 readily follow from lemma 7 below (the 
lemma relies on basic concepts and results from the topic of Hilbert spaces - see app. C.l for a brief 
discussion on the matter). 

Lemma 7 Let fg^^.. be a set of linearly independent functions, and denote by T the (Eu¬ 

clidean) space of tensors with order N and dimension M in each mode. For a given tensor A € T~, denote 
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by h{A) the function in Lf defined by: 

M N 

di,...,d^ — l i—1 

Let {^^};vgA C T be a family of tensors, and A* be a certain target tensor that lies outside the family. 
Assume that for all X G A we have rank{[A^]) < rank {[A*]), where [•] is the matricization operator 
defined in app. B.l. Then, the distance in Lf between h{A*) and {/i(^^)}aga is strictly positive, 

i.e. there exists an e > 0 such that: 

VAe A: j \h{A^) - h{A*)\' > e 


Proof The fact that {/ed(x)}dG[M] are linearly independent in implies that the product functions 

.djvG[M] are linearly independent in ((M®)-^) (see app. C.l). Let be a se¬ 

quence of functions that lie in the span of {HiLi fsd {^i)}di...djve[M], and for every f S N denote by 
the coefficient tensor of under this basis, i.e. A^*'^ gT is defined by; 


M N 

(xi,...,XAr) = Y ^d!,...,d^I[fSd^^^) 

di,...,dN — l i—1 


Assume that converges to h{A*) in ((K®)-^): 


lim 

t—fOO 


J h^^^-h{A* 


2 

= 0 


In a finite-dimensional Hilbert space, convergence in norm implies convergence in representation coefficients 
under any preselected basis. We thus have; 


yd,...dNG[M]:AY..,dd.^-^l,...,d. 

This means in particular that in the tensor space T, A* lies in the closure of Accordingly, in order 

to show that the distance in ((K®)-^) between h{A*) and {/i(^^)}aga is strictly positive, it suffices to 
show that the distance in T between A* and is strictly positive, or equivalently, that the distance 

between the matrix [,A*] and the family of matrices {[,A^]}aga is strictly positive. This however is a direct 
implication of the assumption VA € A ; rank{[A^\) < rankifA*]). ■ 


Appendix C. Derivation of Hypotheses Space 

In order to keep the body of the paper at a reasonable length, the presentation of our hypotheses space (eq. 2) 
in sec. 3 did not provide the grounds for its dehnition. In this appendix we derive the hypotheses space step by 
step. After establishing basic preliminaries on the topic of Lf spaces, we utilize the notion of tensor products 
between such spaces to reach a universal representation as in eq. 2 but with M -G oo. We then make use 
of empirical studies characterizing the statistics of natural images, to argue that in practice a moderate value 
ofM(MG H(IOO)) suffices. 

C.l. Preliminaries on Spaces 

When dealing with functions over scalars, vectors or collections of vectors, we consider spaces, or more 
formally, the Hilbert spaces of Lebesgue measurable square-integrable real functions equipped with standard 
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(point-wise) addition and scalar multiplication, as well as the inner-product defined by integral over point- 
wise multiplication. The topic of function spaces lies at the heart of functional analysis, and requires basic 
knowledge in measure theory. We present here the bare necessities required to follow this appendix, referring 
the interested reader to Rudin (1991) for a more comprehensive introduction. 

For our purposes, it suffices to view an space as a vector space of all functions / satisfying J < oo. 
This vector space is infinite dimensional, and a set of functions T C is referred to as total if the closure of 
its span covers the entire space, i.e. if for any function g G and e > 0, there exist functions fi.. .Jk G T 
and coefficients c\.. .ck G K such that f | Ci ' fi — < £■ is regarded as linearly independent 

if all of its finite subsets are linearly independent, i.e. for any /i.. .fx G J-, fi fj, and ci.. .Ck G M, if 
Ci ■ /i = 0 then ci =■■■= cx = 0. A non-trivial result states that spaces in general must contain 
total and linearly independent sets, and moreover, for any s G N, L^(R^) contains a countable set of this type. 
It seems reasonable to draw an analogy between total and linearly independent sets in space, and bases 
in a finite dimensional vector space. While this analogy is indeed appropriate from our perspective, total and 
linearly independent sets are not to be confused with bases for spaces, which are typically defined to be 
orthonormal. 

It can be shown (see for example Hackbusch (2012)) that for any natural numbers s and N, if {/^(x)}^^^ 
is a total or a linearly independent set in then {(xi,... ,xjv) /d;(xi)}di...djvGN^ the in¬ 

duced point-wise product functions on (K®)^, form a set which is total or linearly independent, respectively, 
in ((M®)^). As we now briefly outline, this result actually emerges from a deep relation between tensor 
products and Hilbert spaces. The definitions given in sec. 2 for a tensor, tensor space, and tensor product, are 
actually concrete special cases of much deeper, abstract algebraic concepts. A more formal line of presenta¬ 
tion considers multiple vector spaces Vi.. .Vn, and defines their tensor product space Vi® ■ ■ ■ ®Vm to be a 
specific quotient space of the space freely generated by their Cartesian product set. For every combination of 
vectors GVi,iG [A^], there exists a corresponding element in the tensor product space, 

and moreover, elements of this form span the entire space. If Ti.. .Vn are Hilbert spaces, it is possible to 
equip Vi®--- ®Vn with a natural inner-product operation, thereby turning it too into a Hilbert space. It may 
then be shown that if the sets C Vi, i G [A^], are total or linearly independent, elements of the form 

viV ® ■ ■ ■ ® are total or linearly independent, respectively, in Vi® ■ ■ ■ ®Vn- Finally, when the under¬ 
lying Hilbert spaces are L^(IR®), the point-wise product mapping fi{x)® ■ ■ ■ ®fN{x) ga fii'^i) from 
the tensor product space (L^(K®))®’^ := L^(]R®)(g) • • • (g)L^(K®) to ((M®)^), induces an isomorphism of 
Hilbert spaces. 

C.2. Construction 

Recall from sec. 3 that our instance space is defined as X := (M®)^, in accordance with the common prac¬ 
tice of representing natural data through ordered local structures (for example images are often represented 
through small patches around their pixels). We classify instances into categories y := {1.. .F} via maxi¬ 
mization of per-label score functions {hy : (K®)^ —Our hypotheses space T-L is defined to be the 
subset of ((K®)^) from which score functions may be taken. 

In app. C.l we stated that if {fd{^)}dGN ^ ^ total set in L^(IR®), i.e. if every function in L^(R®) can 
be arbitrarily well approximated by a linear combination of a finite subset of {/^(x)}^^^, then the point- 
wise products {(xi,..., x^f) !->• /di(xj)}di,,..,djvGN form a total set in ((K®)^). Accordingly, in a 
universal hypotheses space T-L = Lf ((M®)^), any score function hy may be arbitrarily well approximated by 
finite linear combinations of such point-wise products. A possible formulation of this would be as follows. 
Assume we are interested in e-approximation of the score function hy, and consider a formal tensor 
having N modes and a countable infinite dimension in each mode i G [At], indexed by di G N. Then, there 
exists such a tensor, with all but a finite number of entries set to zero, for which: 

N 

/lj,(xi,...,XAr) « (6) 

dx...dN&^ i—1 
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Given that the set of functions {fd{x)}d^nCL'^{M.^) is total, eq. 6 defines a universal hypotheses space. 
There are many possibilities for choosing a total set of functions. Wavelets are perhaps the most obvious 
choice, and were indeed used in a deep network setting by Bruna and Mallat (2012). The special case of 
Gabor wavelets has been claimed to induce features that resemble representations in the visual cortex (Serre 
et al. (2005)). Two options we pay special attention to due to their importance in practice are: 

• Gaussians (with diagonal covariance): 

/e(x) = N (x; fi, diag{(T^)) (7) 

where 9 = {fj, G g 1^++)- 

• Neurons'. 

/e(x) = cr (x^w + 6) (8) 

where 0 = (w g K®, & g K) and cr is a point-wise non-linear activation such as threshold a{z) = 
1 [z > 0], rectified linear unit (ReLU) a{z) = max{z, 0} or sigmoid a{z) = 1/(1 -f e“^). 

In both cases, there is an underlying parametric family of functions T = {fe '■ of which 

a countable total subset may be chosen. The fact that Gaussians as above are total in L^(IR®) has been 
proven in Girosi and Poggio (1990), and is a direct corollary of the Stone-Weierstrass theorem. To achieve 
countability, simply consider Gaussians with rational parameters (mean and variances). In practice, the choice 
of Gaussians (with diagonal covariance) give rises to a “similarity” operator as described by the SimNet 
architecture (Cohen and Shashua (2014); Cohen et al. (2016)). For the case of neurons we must restrict the 
domain M® to some bounded set, otherwise the functions are not integrable. This however is not a limitation 
in practice, and indeed neurons are widely used across many application domains. The fact that neurons 
are total has been proven in Cybenko (1989) and Hornik et al. (1989) for threshold and sigmoid activations. 
More generally, it has been proven in Stinchcombe and White (1989) for a wide class of activation functions, 
including linear combinations of ReLU. See Pinkus (1999) for a survey of such results. For countability, we 
may again restrict parameters (weights and bias) to be rational. 

In the case of Gaussians and neurons, we argue that a finite set of functions suffices, i.e. that it is 
possible to choose fg^.. .fg^ g T that will suffice in order to represent score functions required for natural 
tasks. Moreover, we claim that M need not be large (e.g. on the order of 100). Our argument relies on 
statistical properties of natural images, and is fully detailed in app. C.3. It implies that under proper choice of 
{/ed(x)}de[M], the finite set of point-wise product functions {(xi,..., xn) ^ Jlili (xj)}di.....d„G[M] 
spans the score functions of interest, and we may define for each label y a tensor of order N and dimension 
M in each mode, such that: 

M N 

hy{xi,...,XN)= (2) 

di,...,dN—l i—1 

which is exactly the hypotheses space presented in sec. 3. Notice that if {/6/d(x)}(ig[M]CL^(M®) are linearly 
independent (there is no reason to choose them otherwise), then so are the product functions {(xi,..., xjv) 

.,djvG[M]CL^ ((K®)'^) (see app. C.l), and a score function hy uniquely determines the 
coefficient tensor . In other words, two score functions hy i and fij, 2 identical if and only if their 
coefficient tensors and A'^’’^ are the same. 

C.3. Finite Function Bases for Classification of Natural Data 

In app. C.2 we laid out the framework of classifying instances in the space A" := {(xi,... ,Xjv) : x^ g R®} = 
(R®)-^ into labels y := {1,..., U} via maximization of per-label score functions /ij, : A" —)■ R: 

y(xi,... ,XAf) = argmax/iy(xi ,... ,Xm) 
vey 
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where hy{xi, ..., x^r) is of the form: 

M N 

hy{xi,...,XN)= (2) 

and {/e}de[M] ^re selected from a parametric family of functions T = {fg : K® I^lege- For universality, 
i.e. for the ability of score functions hy to approximate any function in L?{X) as M ^ oo, we required that it 
be possible to choose a countable subset of X that is total in We noted that the families of Gaussians 

(eq. 7) and neurons (eq. 8) meet this requirement. 

In this subsection we formalize our argument that a finite value for M is sufficient when X represents 
natural data, and in particular, natural images. Based on empirical studies characterizing the statistical prop¬ 
erties of natural images, and in compliance with the number of channels in a typical convolutional network 
layer, we find that M on the order of 100 typically suffices. 

Let I? be a distribution of labeled instances (X, y) over X x y (we use bar notation to distinguish the 
label y from the running index y), and Vx be the induced marginal distribution of instances X over X. We 
would like to show, given particular assumptions on T), that there exist functions ,..., /6i„ G T and 
tensors A^,..., of order N and dimension M in each mode, such that the score functions hy defined in 
eq. 2 achieve low classification error: 


rO-l 


{hi 


jhy) :— 


y argmax/iy(Ar) 


(9) 


1 [•] here stands for the indicator function, taking the value 1 when its argument is true, and 0 otherwise. 

Let {hl}y(zy be a set of “ground truth” score functions for which optimal prediction is achieved, or more 
specifically, for which the expected hinge-loss (upper bounds the 0-1 loss) is minimal: 

= argmin {h\,... ^hy) 


where: 

Our strategy will be to select score functions hy of the format given in eq. 2, that approximate h* in the sense 
of low expected maximal absolute difference: 


max{l [y ^ y] + h'y{X)} - h'y{X) 


yey 


( 10 ) 


£ := Exr^Vx 


max \hy{X) 

y&y 



( 11 ) 


We refer to £ as the score approximation error obtained by hy. The 0-1 loss of hy with respect to the labeled 
example {X, y) G X x y is bounded as follows: 


y ^ argmax/ij,(2f) 
y&y 


< ma,x{l [yi^y] + hy{X)} - hy{X) 

y&y 


= max{l [y^y]+ h*y{X) + hy{X) - hl{X)} - h*y{X) + - hy{X) 

< max {1 [y^y\X hl{X)} - h*{X) + max {hy{X) - h*y{X)} + h*{X) - hy{X) 
- \-y^y]+ ^y(^)} - + 2max{\hy{X) - h*y{X)\} 


Taking expectation of the first and last terms above with respect to {X, y) ^ T>, and recalling the definitions 
given in eq. 9, 10 and 11, we get: 

Ll-\hi, ...,hy)< ...,h*y) + 2£ 


21 












Cohen Sharir Shashua 


In words, the classification error of the score functions hy is bounded by the optimal expected hinge-loss 
plus a term equal to twice their score approximation error. Recall that we did not constrain the optimal score 
functions ft,* in any way. Thus, assuming a label is deterministic given an instance, the optimal expected 
hinge-loss is essentially zero, and the classification error of hy is dominated by their score approximation 
error S (eq. 1 1). Our problem thus translates to showing that hy can be selected such that 8 is small. 

At this point we introduce our main assumption on the distribution V, or more specifically, on the 
marginal distribution of instances Vx- According to various studies, in natural settings, the marginal dis¬ 
tribution of individual vectors in X, e.g. of small patches in images, may be relatively well captured by a 
Gaussian Mixture Model (GMM) with a moderate number (on the order of 100 or less) of distinct compo¬ 
nents. For example, it was shown in Zoran and Weiss (2012) that natural image patches of size 2x2, 4x4, 
8 x 8 or 16x 16, can essentially be modeled by GMMs with 64 components (adding more components barely 
improved the log-likelihood). This complies with the common belief that a moderate number of low-level 
templates suffices in order to model the vast majority of local image patches. Following this line, we model 
the marginal distribution of with a GMM having M components with means fii -. S K^. We assume 
that the components are well localized, i.e. that their standard deviations are small compared to the distances 
between means, and also compared to the variation of the target functions ft*. In the context of images for 
example, the latter two assumptions imply that a local patch can be unambiguously assigned to a template, 
and that the assignment of patches to templates determines the class of an image. Returning to general in¬ 
stances X, their probability mass will be concentrated in distinct regions of the space X, in which for every 
i G [W], the vector lies near for some q S [M], The score functions ft* are approximately constant 
in each such region. It is important to stress here that we do not assume statistical independence of x^’s, only 
that their possible values can be quantized into M templates /x^,..., jJ,bl¬ 
under our idealized assumptions on Vx, the expectation in the score approximation error 8 can be 
discretized as follows: 


8 := ^Xr~.Vx 


max fty(A:) 
!/6T 



where A4ci,...,cjv ■= (Mcij • ■ • > Mcw) stands for the probability that x^ lies near /x^. for every 

i G [W] (Pci,...,CN^0, X]ci,...,Cjv ^Ci,...,CN = !)• 

We now turn to show that . ./g„ can be chosen to separate GMM components, i.e. such that for 
every c,d G [M], feaihtc) 0 if only if c = d. If the functions fg are Gaussians (eq. 7), we can simply 
set the mean of fg^ to /x^, and its standard deviations to be low enough such that the function effectively 
vanishes at /x^ when c 7 ^ d. If fg are neurons (eq. 8 ), an additional requirement is needed, namely that the 
GMM component means . ./x^ be linearly separable. In other words, we require that for every d G [M], 
there exist wg G K® and bg G R for which wj/x^ -|- bd is positive if c = d and negative otherwise. This may 
seem like a strict assumption at first glance, but notice that the dimension s is often as large, or even larger, 
then the number of components M. In addition, if input vectors x^ are normalized to unit length (a standard 
practice with image patches for example), /Xj^.. ./x^ will also be normalized, and thus linear separability is 
trivially met. Assuming we have linear separability, one may set Og = (w^, bg), and for threshold or ReLU 
activations we indeed get fg^inf) 7 ^ 0 c = d. With sigmoid activations, we may need to scale (w^;, bd) 
so that wj + bd ^ 0 when cj^d, and that would ensure that in this case fg^ (fif) effectively vanishes. 

Assuming we have chosen fg^.. .fg^ to separate GMM components, and plugging-in the format of hy 
given in eq. 2, we get the following convenient form for fty(A4ci....,CN)- 

N 




2=1 


Assigning the coefficient tensors through the following rule: 




fty (A4ci,...,cjv) 
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implies: 

for every y G y and ci.. .Cat S [M]. Plugging this into eq. 12, we get a score approximation error of zero. 

To recap, we have shown that when the parametric functions fg are Gaussians (eq. 7) or neurons (eq. 8), 
not only are the score functions hy given in eq. 2 universal when M ^ oo (see app. C.2), but they can also 
achieve zero classification error (eq. 9) with a moderate value of M (on the order of 100) if the underlying 
data distribution T) is “natural”. In this context, T) is regarded as natural if it satisfies two conditions. The 
first, which is rather mild, requires that a label be completely determined by the instance. For example, an 
image will belong to one category with probability one, and to the rest of the categories with probability 
zero. The second condition, which is far more restrictive, states that input vectors composing an instance 
can be quantized into a moderate number (M) of templates. The assumption that natural images exhibit this 
property is based on various empirical studies where it is shown to hold approximately. Since it does not hold 
exactly, our analysis is approximate, and its implication in practice is that the classification error introduced 
by constraining score functions to have the format given in eq. 2, is negligible compared to other sources of 
error (factorization of the coefficient tensors, finiteness of training data and difficulty in optimization). 


Appendix D. Related Work 

The classic approach for theoretically analyzing the power of depth focused on investigation of the com¬ 
putational complexity of boolean circuits. An early result, known as the “exponential efficiency of depth”, 
may be summarized as follows: for every integer k, there are boolean functions that can be computed by a 
circuit comprising alternating layers of AND and OR gates which has depth k and polynomial size, yet if one 
limits the depth to A: — 1 or less, an exponentially large circuit is required. See Sipser (1983) for a formal 
statement of this classic result. Recently, Rossman et al. (2015) have established a somewhat stronger result, 
showing cases where not only are polynomially wide shallow boolean circuits incapable of exact realization, 
but also of approximation (i.e. of agreeing with the target function on more than a specified fraction of input 
combinations). Other classical results are related to threshold circuits, a class of models more similar to 
contemporary neural networks than boolean circuits. Namely, they can be viewed as neural networks where 
each neuron computes a weighted sum of its inputs (possibly including bias), followed by threshold activation 
(^{z) = l[z > 0]). For threshold circuits, the main known result in our context is the existence of functions 
that separate depth 3 from depth 2 (see Hajnal et al. (1987) for a statement relating to exact realization, and 
the techniques in Maass et al. (1994); Martens et al. (2013) for extension to approximation). 

More recent studies focus on arithmetic circuits (Shpilka and Yehudayoff (2010)), whose nodes typically 
compute either a weighted sum or a product of their inputs ® (besides their role in studying expressiveness, 
deep networks of this class have been shown to support provably optimal training Livni et al. (2014)). A spe¬ 
cial case of this are the Sum-Product Networks (SPNs) presented in Poon and Domingos (2011). SPNs are 
a class of deep generative models designed to efficiently compute probability density functions. Their sum¬ 
mation weights are typically constrained to be non-negative (such an arithmetic circuit is called monotone), 
and in addition, in order for them to be valid (i.e. to be able to compute probability density functions), addi¬ 
tional architectural constraints are needed (e.g. decomposability and completeness). The most widely known 
theoretical arguments regarding the efficiency of depth in SPNs were given in Delalleau and Bengio (201 1). 
In this work, two specific families of SPNs were considered, both comprising alternating sum and product 
layers - a family iF whose nodes form a full binary tree, and a family Q with n nodes per layer (excluding 
the output), each connected to n — 1 nodes in the preceding layer. The authors show that functions imple¬ 
mented by these networks require an exponential number of nodes in order to be realized by shallow (single 

9. There are different definitions for arithmetic circuits in the literature. We adopt the definition given in Martens and 
Medabalimi (2014), under which an arithmetic circuit is a directed acyclic graph, where nodes with no incoming 
edges correspond to inputs, nodes with no outgoing edges correspond to outputs, and the remaining nodes are either 
labeled as “sum” or as “product”. A product node computes the product of its child nodes. A sum node computes a 
weighted sum of its child nodes, where the weights are parameters linked to its incoming edges. 
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hidden-layer networks). The limitations of this work are twofold. First, as the authors note themselves, it only 
analyzes the ability of shallow networks to realize exactly functions generated by deep networks, and does 
not provide any result relating to approximation. Second, the specific SPN families considered in this work 
are not universal hypothesis classes and do not resemble networks used in practice. Recently, Martens and 
Medabalimi (2014) proved that there exist functions which can be efficiently computed by decomposable and 
complete (D&C) SPNs of depth d + 1, yet require a D&C SPN of depth d or less to have super-polynomial 
size for exact realization. This analysis only treats approximation in the limited case of separating depth 4 
from depth 3 (D&C) SPNs. Additionally, it only deals with specific separating functions, and does not con¬ 
vey information regarding how frequent these are. In other words, according to this analysis, it may be that 
almost all functions generated by deep networks can be efficiently realized by shallow networks, and there 
are only few pathological functions for which this does not hold. A further limitation of this analysis is that 
for general d, the separation between depths d + 1 and d is based on a multilinear circuit result by Raz and 
Yehudayoff (2009), that translates into a network that once again does not follow the common practices of 
deep learning. 

There have been recent attempts to analyze the efficiency of network depth in other settings as well. 
The most commonly used type of neural networks these days includes neurons that compute a weighted 
sum of their inputs (with bias) followed by Rectified Linear Unit (ReLU) activation {a{z) — max{0, z}). 
Pascanu et al. (2013) and Montufar et al. (2014) study the number of linear regions that may be expressed 
by such networks as a function of their depth and width, thereby showing existence of functions separating 
deep from shallow (depth 2) networks. Telgarsky (2015) shows a simple construction of a depth d width 2 
ReLU network that operates on one-dimensional inputs, realizing a function that cannot be approximated by 
ReLU networks of depth o{d/ log d) and width polynomial in d. Eldan and Shamir (2015) provides functions 
expressible by ReLU networks of depth 3 and polynomial width, which can only be approximated by a depth 
2 network if the latter’s width is exponential. The result in this paper applies not only to ReLU activation, 
but also to the standard sigmoid (a(z) = 1/(1 -|- e“^)), and more generally, to any universal activation (see 
assumption 1 in Eldan and Shamir (2015)). Bianchini and Scarselli (2014) also considers different types of 
activations, studying the topological complexity (through Betti numbers) of decision regions as a function of 
network depth, width and activation type. The results in this paper establish the existence of deep vs. shallow 
separating functions only for the case of polynomial activation. While the above works do address more 
conventional neural networks, they do not account for the structure of convolutional networks - the most 
successful deep learning architectures to date, and more importantly, they too prove only existence of some 
separating functions, without providing any insight as to how frequent these are. 

We are not the first to incorporate ideas from the field of tensor analysis into deep learning. Socher 
et al. (2013), Yu et al. (2012), Setiawan et al. (2015), and Hutchinson et al. (2013) all proposed different 
neural network architectures that include tensor-based elements, and exhibit various advantages in terms of 
expressiveness and/or ease of training. In Janzamin et al. (2015), an alternative algorithm for training neu¬ 
ral networks is proposed, based on tensor decomposition and Eourier analysis, with proven generalization 
bounds. In Novikov et al. (2014), Anandkumar et al. (2014), Yang and Dunson (2015) and Song et al. (2013), 
algorithms for tensor decompositions are used to estimate parameters of different graphical models. No¬ 
tably, Song et al. (2013) uses the relatively new Hierarchical Tucker decomposition (Hackbusch and Ktihn 
(2009)) that we employ in our work, with certain similarities in the formulations. The works differ consider¬ 
ably in their objectives though: while Song et al. (2013) focuses on the proposal of a new training algorithm, 
our purpose in this work is to analyze the expressive efficiency of networks and how that depends on depth. 
Recently, Lebedev et al. (2014) modeled the filters in a convolutional network as four dimensional tensors, 
and used the CP decomposition to construct an efficient and accurate approximation. Another work that draws 
a connection between tensor analysis and deep learning is the recent study presented in Haeffele and Vidal 
(2015). This work shows that with sufficiently large neural networks, no matter how training is initialized, 
there exists a local optimum that is accessible with gradient descent, and this local optimum is approximately 
equivalent to the global optimum in terms of objective value. 
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Appendix E. Computation in Log-Space with SimNets 

A practical issue one faces when implementing arithmetic circuits is the numerical instability of the product 
operation - a product node with a large number of inputs is easily susceptible to numerical overflow or 


underflow. A common solution to this is to perform the computations in log-space, i.e. instead of computing 
activations we compute their log. This requires the activations to be non-negative to begin with, and alters 
the sum and product operations as follows. A product simply turns into a sum, as log ai = log Oi. A 
sum becomes what is known as log-sum-exp or softmax: log ai = log exp(log ai). 

Turning to our networks, the requirement that all activations be non-negative does not limit their univer¬ 
sality. The reason for this is that the functions fg are non-negative in both cases of interest - Gaussians (eq. 7) 
and neurons (eq. 8). In addition, one can always add a common offset to all coefficient tensors ensuring 
they are positive without affecting classification. Non-negative decompositions (i.e. decompositions with all 
weights holding non-negative values) can then be found, leading all network activations to be non-negative. 
In general, non-negative tensor decompositions may be less efficient than unconstrained decompositions, as 
there are cases where a non-negative tensor supports an unconstrained decomposition that is smaller than its 
minimal non-negative decomposition. Nevertheless, as we shall soon see, these non-negative decompositions 
translate into a proven architecture, which was demonstrated to achieve comparable performance to state of 
the art convolutional networks, thus in practice the deterioration in efficiency does not seem to be significant. 

Naively implementing CP or HT model (fig. 1 or 2 respectively) in log-space translates to log activation 
following the locally connected linear transformations (convolutions if coefficients are shared, see sec. 3.3), 
to product pooling turning into sum pooling, and to exp activation following the pooling. However, applying 
exp and log activations as just described, without proper handling of the inputs to each computational layer, 
would not result in a numerically stable computation 

The SimNet architecture (Cohen and Shashua (2014); Cohen et al. (2016)) naturally brings forth a nu¬ 
merically stable implementation of our networks. The architecture is based on two ingredients - a flexible 
similarity measure and the MEX operator: 



The similarity layer, capable of computing both the common convolutional operator as well as weighted 
Ip norm, may realize the representation by computing log/e(xi), whereas MEX can naturally implement 
both log-sum-exp and sum-pooling (lim/ 3 _>o MEX^(x, 0) = meanj{xj}) in a numerically stable manner. 

Not only are SimNets capable of correctly and efficiently implementing our networks, but they have 
already been demonstrated (Cohen et al. (2016)) to perform as well as state of the art convolutional networks 
on several image recognition benchmarks, and outperform them when computational resources are limited. 


10. Naive implementation of softmax is not numerically stable, as it involves storing at = exp(logai) directly. This 
however can be easily corrected by defining c := maxi logQi, and computing logX^i expjlogcti — c) -I- c. The 
result is identical, but now we only exponentiate negative numbers (no overflow), with at least one of these numbers 
equal to zero (no underflow). 
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